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Abstract 

The study of the distribution of volumes associated to the internal represen- 
tations of learning examples allows us to derive the critical learning capacity 
(oc = ^^fh\K) of large committee machines, to verify the stability of the so- 
lution in the limit of a large number K of hidden units and to find a Bayesian 

generalization cross-over at a = K. 
PACS Numbers : 05.20 - 64.60 - 87.10 
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I. INTRODUCTION 



Following the approach presented in refs. |l|J^, we derive the learning behaviour of non 
overlapping committee machines with a large number K of hidden units. Scope of the 
paper is to clarify some of the analytical aspects of a method which is based on the internal 
degrees of freedom of MultiLayer Networks (MLN) and which requires a double analytic 
continuation. Such an approach, beside allowing for the derivation of new results both for the 
learning and the generalization behaviour of MLN, makes a rigorous bridge between different 
fields in the theory of neural computation, such as Information Theory, VC-dimension and 
Bayesian rule extraction, and statistical mechanics [|li|TllP|,|imp!2| . Moreover, it sheds new 
light on the role of internal representations of the learning examples by relating it to the 
distribution of domains of solutions and pure states in the weight space of the network. 

The method consists in a generalization of the well known Gardner approach Q]. While 
the latter studies the typical volume of couplings associated to the overall input-output 
map implemented by the network, here we consider the decomposition of such volume in 
a macroscopic number of single volumes associated to all possible internal representations 
compatible with the learned examples. In addition to the interaction weights, we take as 
dynamical variables also the internal state variables of the MLN characterizing internal 
representations. For the storage problem, we are therefore interested in counting the typical 
number exp(A/j:i) of volumes giving the dominant contribution to Gardner's volume and to 
compare it with the total number exp(A/ij) of non-empty volumes. At the learning transition, 
i.e. when the Gardner's total volume shrinks to zero and no more patterns can be learned 
without errors, we expect both entropies Md and Mr to vanish. The vanishing condition on 
Md and Mr gives thus an alternative indication on the storage performance of the studied 
network. 

The generalization properties of the network, i.e. the rule inference capability from a 
given set of deterministic input-output examples, also depend on the geometrical structure 
of the weight space. As we shall discuss in the sequel, the internal representation approach 
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can be straightforwardly extended to the study of the gerahzation error of MLN in the 
Bayesian framework [|1^,|T^, allowing for a geometrical interpretation of the generalization 
transition together with a clarification of the role of the VC-dimension [Q. 

The method discussed here for the committee machine can be straightfordwardly ex- 
tended to other non overlapping MLN with arbitrary decoder functions. For instance, one 
may show that the stability analysis of the RS solution etc = In K/ In 2 for parity machine 
is an exact result in the limit K » 1. Such a result coincides with the one derived in ||5| 
following the standard Gardner @] approach with one step of Replica Symmetry Breaking. 
Moreover, one may also reproduce the known results on the parity machine generaliza- 
tion transition [0 by means of a detailed geometrical intepretation. 

The paper is organized as follows. In Sec. II we outline the basic points of our approach, 
both for learning and for the Bayesian generalization problems. In Sec. Ill and Sec. IV we 
study the entropies ATr and Md in the >> 1 limit and compute the closed expression for 
the critical capacity. The detailed analysis of the stability of the solution is given in Sec. V. 
Finally, in Sec. VI, we derive the entropy M.^, (for K » 1) of the internal representation 
contributing to the Bayesian entropy. This allows us to analyze the generalization transistion 
of the committee machine and to explain why the VC-dimension is not relevant for its typical 
generalization properties. 



II. THE INTERNAL REPRESENTATION VOLUMES APPROACH 

We consider tree-like commitee machines composed of K non-overlapping perceptrons 
with real- valued weights and connected to K sets of independent inputs (£ = 1, i^, 
2 = 1, ...,N/K). Committee machines are characterized by an output a which is a binary 
function /({r^}) = sign.{J2e '^e) of the cells ti = sign(X]i Ja^ei) in the hidden layer. We refer 
to the set {r^} as the internal representation of the input pattern {^u}. Given a macroscopic 
set of P = aN binary unbiased patterns patterns (the training set), the learning problem 
consists in finding a suitable set of internal representations T = {r^} with a corresponding 
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non zero volume 

Vr= fUdJuUH^'fi{rn)m(^Uj:J^^^e) ^ fl[dJu = l , (1) 

where 9{. . .) is the Heaviside function. 

The total volume of the weight space available for learning, i.e. Gardner's total volume, 
is given by Vq = J2t^t- We are interested in discussing the limit InVc — ^ — oo, which 
defines the maximal possible size of the training set or the critical capacity of the model. 
The bar denotes the average over the patterns and their corresponding outputs which, 
as usual, are drawn according to the binary unbiased distribution law. 

As discussed in the partition of Vq into connected components may be naturally 
obtained using the volumes Vr associated to the internal representations. This allows to 
give a geometrical interpretation of the learning and Bayesian generalization process in terms 
of the characteristics of the volumes dominating the overall distribution . 

Following the standard statistical mechanics approach, we first compute 

^W--^ln(E^f) (2) 

and next derive the entropy A/'(w) of the volumes Vr whose sizes are equal to w = In Vr- 
This can be done using the Legendre relations Wr = ^^"^qI^^^ and J\f{wr) = • Diversely 

from the standard replica calculations, here we deal with two analytic continuations: we 
have r blocks of n replicas and, once the we average over the quenched patterns for r and 
n integer has been done, we perform an analytic continuation to real values of r and n. 
Labeling blocks and replicas by p, A and a, b respectively, the spin glass order parameters 
read 

K 

ap,b\ _ \^ jap jb\ fo\ 

'it ~ 2^ '^ic '^it ■ y"^) 

i 

They represent the typical overlaps between weight vectors incoming onto the same hidden 
unit £ {i = 1, . . . ,K) and belinging to blocks p, A and replicas a, b. Associated to the the g"'''^'^ 
there are also the conjugate Lagrange multipliers g"'''^^. Since hidden units are equivalent, 
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we assume that at the saddle point g"''' = q"-P'°^ and g"''' = q°-P^°'^ independently of ^. 
Then, within the replica symmetric (RS) Ansatz 0, we find 

{1 — r 1 q 
— — ln(l - g*) - — ln(l - + r(g, - g)) - — ■ — 
2r 2r 2(1 - g* + r(g* - g)) 

-7 /n^^^ ln7^({x4)| , (4) 

where 

(5) 

Here, g*(r) = q°-p''^'^ and g(r) = g"/*'*-** are the typical overlaps between two weight vectors 
corresponding to the same (a,p 7^ A) and to different (a 7^ h) internal representations 
T respectively 0,0. The Gaussian measure is denoted by Dx = -^^^^^^"^ whereas the 
function H is defined as H{y) = Dx. Since with no loss of genereality the outputs a'^ 
can be set equal to 1, in eqn.(^) the sum Trj^-^} runs over the internal representations {rg} 
giving a positive output /({t^}) = +1 only. 

As discussed in [0, when ^ 00, ;^ln(VG) = —g{r = 1) is dominated by volumes of 
size Wr=i whose corresponding entropy (i.e. the logarithm of their number divided by A^) is 
A/z3 = N'{wr=i). At the same time the most numerous volumes are those of smaller size Wr=o, 
since in the limit r all the T are counted irrespectively of their relative volumes. Their 
corresponding entropy A/}? = N'{wr=o) is the (normalized) logarithm of the total number 
of implementable internal representations. The quantities A/d and JVr are easily obtained 
from the RS free-energy eqn.(^ using Legendre identities. In particular, g(r = 1) is the 
usual saddle point overlap of the Gardner volume g{l) [^0. The vanishing condition for the 
entropies is related to the zero volume condition for Vg and thus to the storage capacity of 
the models. 

In the above discussion, we have focused on the storage problem. However, the gen- 
eralization properties also depend on the internal structure of the coupling space. Let us 
for instance consider the case of a learnable rule, defined by a teacher network. When a 
student with the same architecture is given more and more examples of the rule to infer, 

5 



n{{xe})=TTjl 



Dy,H 



yey/q* - q + nxj^ 
VI - q* 



its version space shrinks. In the perceptron case, the version space is simply connected 
and the typical generalization error done by the student on a new example goes to zero as 
its overlap with the teacher increases. The situation is much more involved in multilayer 
neural networks since the presence of separated components of the version space makes the 
alignment of the student along the teacher direction more difficult. The approach we have 
exposed above for the learning problem may be extended to acquire a better understanding 
of the generalization process in multilayer networks. 

We shall restrict to the Bayesian framework where all teacher are sorted according to 
their a priori probabilities. The generalization properties are derived through the Bayesian 
entropy 

^G = -^EW^^ , (6) 

{an 

where the sum runs over all 2^ sets of possible outputs. If we know intend to look at the 
distibution of the sizes fo the internal representation volumes Vr, we have to consider the 
generating free-energy M 



To compute s(r) with the replica method, we have to introduce 1 + nr replicas and send 
— > at the end of the computation. The order parameters entering the computation are 
the overlaps p"'P between the teacher and the nr students and the overlaps q°-P^^^ between 
two differents students. Within the RS Ansatz, we assume that p""^ = p and q°-p^'^^ = g if 
a b, q* otherwise. The result we obtained is 

f 1 — r 1 q — p^ 

s{r) = Extr < — - — ln(l -(?*)-— ln(l - g* + r(g* - q)) 



1 2r 2r 2(1 - + r{q^ - q)) 



2a /■ „ „ T-r / TiXip 



— / \{Dx, 



£=1 



\nn{{x,})\ , (8) 



M f \y/%-p\ 

In the following , we shall focus on the logarithm (divided by A^) of the number of internal 
representations contributing to Sq = — ^(l), that is 
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Mn=g^ir = l) . (9) 

It can be easily verified that p = q is always a saddle-point when r = 1. In the Bayesian 
framework, the typical overlap between two student is equal to the scalar product between 
the teacher and any student. 



III. ANALYSIS OF Mr IN THE K »1 LIMIT 

We first focus on the r — > case to compute the typical logarithm A/}? of the total number 
of internal representations. One can check on the saddle-point equations for q, q* that the 
correct scalings of the order parameters are q = 0(1) and q* = 1 — 0{r). In the following, 
we shall call /i = limr^ol'^/ (1 ~ 1*)]- Upon keeping the leading terms in K, the trace over T 
in equation (H) becomes 



where 



and 



1 ^ B{x,) + B{-xe) 



Kf^^\ A{xe) 
In the above expressions we have adopted the definitions 



^i+Mi^ 



(13) 



and 



Bix)^Hix.U-\+ ^) ^^^^^^^-^»^ H\ - | . (14) 



In the K ^ oo limit, Qi becomes a gaussian variable with zero mean and variance Q2 = 
J Dx{B{x) + B{—x))'^/A{xy. The free-energy for r — then reads —G{q,fi)/r + O(lnr) 
where 



+ (15) 

and the typical logarithm AZ/j of the total number of internal representations is simply the 
maximum of G{q, fi) over q and /i. Taking the scaling relation /i = mK^ (which can be 
inferred from the equation = 0), and q = 0(1), one finds 

2 

Q2 = — arcsin(g) . (16) 

TT 

Finally, defining q = 1 — e and taking the saddle point equation with respect to m (which 
implies m = a^) and e, one finds the following result 

N-H = HK)-^ + 0{Ha)) , (17) 

which vanishes at 
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aR = —JlniK) . (18) 

TT * 



IV. ANALYSIS OF Md IN THE K »1 LIMIT 

We shall now concentrate on the r — > 1 case, which corresponds to the internal represen- 
tations giving the dominant contribution to the Gardner volume Vg- The typical logarithm 
of such internal representations is A/i). Before taking the limit K ^ 00, the Legendre 
transform of expression (^) gives 



Tr^^.y Uf=. H [nx,^) J DyH ( ^v^^-^^ ) In H (^l-l^^-VR] 

Y[DXi y — ^ (19) 



where In Vq is the replica symmetric expression of the Gardner volume. When K is large, the 
trace over all allowed internal representations may be evaluated as in the previous section 
0. We find the following scalings for the order parameters 
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128 

g ~ 1 



where 



1 r'^ 

-v^ / duH{u)\nH{u) (21) 

J — oo 



r 

for large K and a. Therefore, the asymptotic expression of the entropy of contributing 
internal representations is 

Mn = HK)-'^ + 0{Ha)) , (22) 



which vanishes at 
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aD = —Jln{K) . (23) 



TT 

For large K, an coincide with an. Reasonnably, we expect ac to be equal to these critical 
numbers and to scale as ^^J\n{K) too. 

V. STABILITY ANALYSIS 

In order to show that our RS calculation of Afn is asymptotically correct when the num- 
ber of hidden units K is large, we have checked its local stability with respect to fluctuations 
of the order parameter matrices. Although it would require a complete analysis of the eigen- 
values of the Hessian matrix, we have focused only on the replicons Oil and 122 in the 
notations of [0 , which are usually the most "dangerous" modes . For a free-energy func- 
tional depending only on one order parameter matrix q"-P^^^ ^ the corresponding eigenvalues 
are 

= 2r \- 2,(t 1) h 

dq"-P'^^dq"-P'^^ ^q°■P''^^^q°■P''^^^ ^q"^P'^^^q"■P'^^^ 

{r-lf — - 2r(r-l) — - + — - (24) 

Qq'^p'^^Qq'^p^^^ Qq'^Pi^^Qq'^f^^'^'^ Qq'^p^b^ Qqcfj,,du ^ ' 
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and 



A 



122 



+ 



(25) 



Qqap,aXQqap,aX Qqap,aXQqap,a^ Qqap,aXQqap,bu 

are given by formula (41) in ref. 0. In our case, however, the free-energy depends upon the 
2K matrices {Qi, Qi}. According to [§,0, the stabihty condition for each mode reads 



A(«, K)=A{A + {K-l)A)- — <0 



(26) 



where A, A, A are the eigenvalues computed for the fluctuations with respect to QeQe, QeQc 
and QeQm (^ 7^ ^) respectively. Since we are interested in the stability of the saddle-point 
giving jVji, we focus on the limit r 0. In this case, the correct scalings of the order 
parameters are q* = 1 — r/ fi + O(r^), q = 0(1). For the (Oil) mode, we find 



Aon 
Aon 

Aon 



K 

2 ^ K 



iVi 
D 



ajj, 



K 



£=1 



/No 


fi 


D 
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at leading order when r -C 1. The quantities defined in ( p7D are 



K 



{rt} 



(27) 



(28) 



A^i = Tr 

{rt} 



H 



No = Tr 



N, 



K 



n B{nx,) 



1=2 



exp x\ 2'^^ 



1 - 



liqxi 



-rixi^q/{l -q)\ l^\J q{l - q) 



X 



exp 



N3 = Tr 



K 



n B{nx,) 

i=2 
K 

Y[ B{TiXi) 



£=3 



X 

X2 



Y1Y2 



Tr 



Tr 



Y[ B{TiXi) 

n Binxe) 
£^2 



X 
X 



Fi X 



Yo 



x\jiq 



2X)(l-g), 



(29) 
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in which we have posed X — X{ii, g) = 1 + //(I — q) and where 
Yi = exp — — — — / - H — - + — = exp 



2X{l-q)^ 



■ (30) 



2X J [Vx^r^ \^/X^/T^J V27r 
In the large K limit, the asymptotic expressions of the order parameters are n ~ a^K"^ and 



g ~ 1 — 128/77^/0;^. Using the previous expressions of Aqu, Aqu, Aqu, we have 

%/2 



^011 



'a,K) 



(31) 



when K ^ 1 and a ^ 1. Therefore, our RS solution is unstable against Oil replicon 
fluctuations. However, in the large K limit, Aqu vanishes and the RS Ansatz becomes 
marginally stable. 

Let us now analyse the (122) mode. Similar calculations lead to 

1 o 



A 



122 



A122 — 



A122 = 



(32) 



where 



m= Tr 



K 

l[B{reXe) 

U=2 



{re} 

Therefore, we obtain 



exp 



2(l+/i(l-q)) 



H 



-TiXiy/q 



yrT7*(i^ vVi+A*(i-?)v^i^y 



(33) 



1 



2^2 



(34) 



when K ^ 1 and a — 0{1). We notice that the 122 mode is always stable and a unique 
order parameter q^ is thus sufficient to describe the volume associated to a set of internal 
representations T. 



VI. ANALYSIS OF Md IN THE K»l LIMIT 



Let us now turn to the generalization problem. The Bayesian entropy —s{r = 1) is given 



by 
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5G = Extr|| + ^ln(l-g) + 2a J Y[Dxen{{x,})\nn{{xe})^ (35) 
where, as r = 1, T-C{{xi}) depends on q only (^. In the large K limit, the scaling of q is 

(36) 

where F has been defined in (^). Therefore, the Bayesian entropy asymptotically equals 
5*^ = 21na and the generalization error decreases as Cg = 2T /a. This proves that, contrary 
to the parity machine case ||T^ , [To[| , only a small fraction among the 2^ possible sets of 
outputs contribute to Sg and explains why the generalization curve is smooth around ~ 



yhiK 0, defined by an average over all sets of outputs. The typical entropy of internal 
representations is given by 

Mo = Extr |-i -q') + \ - q) + \(q - q') + 2a j \[H({x,}) lnW({rc,})- 

(37) 

In the limit 1 -C a <^ i^, the internal overlap q* scales as 

and the entropy of contributing internal representations reads 

Md = In - In a (39) 

Therefore, defined for the storage problem, and more generally the Vapnik-Chervonenkis 
dimension 0, are not relevant for the typical generalization properties of a large committee 
machine inferring a learnable rule. We can moreover note that above ac — K, one single 
domain survives and the generalization error asymptotically decreases as = F/a as is for 
finite K and large a . To end with, the condition q = q* signaling that a unique volume 
is non empty gives back the estimated value of ac- 
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VII. CONCLUSION 



In this paper we have developed a complete analysis of the learning and generalization 
properties of large committee machines. Our approach - in which the weight space is 
partitioned according to the internal representations of the learning examples - allows us to 
derive the relevant entropies Mr, Mb and M. d and successively to find the storage capacity 
of the model, to verify the stability of the solution and to study the rule inference capability. 
In ref. [|I| we have discussed the physical and geometrical issues arising in the application of 
such a method to the learning and generalization theory of MLN. Here the chief results are 
the explicit derivation of the asymptotic storage capacity endowed with a detailed analysis 
of the stability of the solution and the derivation of the generalization cross-over at a = K. 
From a methodological point of view, it is interesting to note that the RS computation of 
the distribution of volumes is very close to the one-step calculation of the Gardner volume. 
However, it is technically simpler and allows for instance the derivation of the asymptotic 
storage capacity of large committee machines while the same quantity seemed out of reach 
using the standard RSB computation 0. 
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